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ABSTRACT 

Compressed Counting ( CC) was recently proposed for 
approximating the ath frequency moments of data streams, 
for < a < 2. Under the relaxed strict- Turnstile model, 
CC dramatically improves the standard algorithm based on 
symmetric stable random projections, especially as a — > 1. A 
direct application of CC is to estimate the entropy, which is 
an important summary statistic in Web/network measure- 
ment and often serves a crucial "feature" for data mining. 
The Renyi entropy and the Tsallis entropy are functions of 
the ath frequency moments; and both approach the Shan- 
non entropy as a — > 1. A recent theoretical work suggested 
using the ath frequency moment to approximate the Shan- 
non entropy with a — 1 + 6 and very small |5| (e.g., < 10 -4 ). 

In this study, we experiment using CC to estimate fre- 
quency moments, Renyi entropy, Tsallis entropy, and Shan- 
non entropy, on real Web crawl data. We demonstrate the 
variance-bias trade-off in estimating Shannon entropy and 
provide practical recommendations. In particular, our ex- 
periments enable us to draw some important conclusions: 

• As a —> 1, CC dramatically improves symmetric stable 
random projections in estimating frequency moments, 
Renyi entropy, Tsallis entropy, and Shannon entropy. 
The improvements appear to approach "infinity." 

• CC is a highly practical algorithm for estimating Shan- 
non entropy (from either Renyi or Tsallis entropy) with 
ami. Only a very small sample (e.g., 20) is needed 
to achieve a high accuracy (e.g., < 1% relative errors). 

• Using symmetric stable random projections and a — 
1 + 5 with very small \S\ does not provide a practical al- 
gorithm because the required sample size is enormous. 

• If we do need to use symmetric stable random projec- 
tions for estimating Shannon entropy, we should ex- 
ploit the variance-bias trade-off by letting a be away 
from 1, for much better performance. 

• Even in terms of the best achievable performance in 
estimating Shannon entropy, CC still considerably im- 
proves symmetric stable random projections by one or 
two magnitudes, both in terms of the estimation accu- 
racy and the required sample size (storage space). 



1. INTRODUCTION 

The general theme of "scaling up for high dimensional data 
and high speed data streams" is among the "ten challenging 
problems in data mining research" [33]. This paper focuses 
on a very efficient algorithm for estimating the entropy of 
data streams using a recently developed randomized algo- 
rithm called Compressed Counting (CC) by Li 23 21 124] . 
The underlying technique of CC is maximally-skewed stable 
random projections. Our experiments on real Web crawl 
data demonstrate that CC can approximate entropy with 
very high accuracy. In particular, CC (dramatically) im- 
proves symmetric stable random projections (Indyk [18] and 
Li [221 ) for estimating entropy, under the relaxed strict- 
Turnstile model. 

1.1 Data Streams and Relaxed 
Strict- Turnstile Model 

While traditional machine learning and mining algorithms 
often assume static data, in reality, data are often constantly 
updated. Mining data streams [T51l3l[T1[27] in (e.g.,) 100 TB 
scale databases has become an important area of research, as 
network data can easily reach that scale [33J . Search engines 
are a typical source of data streams (Babcock et.al. [3j). 

We consider the Turnstile stream model (Muthukrishnan 
[27]). The input stream a t = (it, It), it 6 [1, D] arriving 
sequentially describes the underlying signal A, meaning 

A t [i t ] = At.^it] + It, (1) 

where the increment I t can be either positive (insertion) or 
negative (deletion). For example, in an online bookstore, 
may record the total number of books that user i 
has ordered up to time t—1 and It denotes the number of 
books that this user orders (It > 0) or cancels (I t < 0) at t. 

It is often reasonable to assume A t [i] > 0, although I t may 
be either negative or positive. Restricting A t [i] > results 
in the strict- Turnstile model, which suffices for describing 
almost all natural phenomena. For example, in an online 
store, it is not possible to cancel orders that do not exist. 

Compressed Counting (CC) assumes a relaxed strict- 
Turnstile model by only enforcing At [i] > at the t one cares 
about. At other times s ^ t, CC allows A s [i] to be arbitrary. 
This is more general than the strict- Turnstile model. 

1.2 Moments and Entropies of Data Streams 

The ath frequency moment is a fundamental statistic: 

*(«.)= f>[*r- (2) 



When a — 1, Fru is the sum of the stream. It is obvious that 
one can compute Fru exactly and trivially using a simple 
counter, because F^) = J^iLi -"M*] = ^Cs=o Is - 

A t is basically a histogram and we can view pt = ^'^'J 

as probabilities. An extremely useful (especially in Web and 
networks 35,26 ) summary statistic is the Shannon entropy: 

ff = -E^^- where *(i)=X>[t]. (3) 

Various generalizations of the Shannon entropy exist. The 
Renyi entropy [25], denoted by H a , is defined as 

Ha = i — ^ lo s / n = i — - log -^r- 4 

l ~ a (Ef=iAtW) 1 " Q F d) 

The Tsallis entropy [121132] . denoted by T a , is defined as, 

which was first introduced by Havrda and Charvat [12] and 
later popularized by Tsallis [32] , 

It is easy to verify that, as a — > 1, both the Renyi entropy 
and Tsallis entropy converge to the Shannon entropy. Thus 
H = Hi = Ti in the limit sense. For this fact, one can also 
consult http : //en. wikipedia. org/wiki/Renyi_entropy 

Therefore, both the Renyi entropy and Tsallis entropy can 
be computed from the ath frequency moment; and one can 
approximate the Shannon entropy from either H a or T a by 
using a « 1. In fact, several studies (Zhao et.al. [35] and 
Harvey et.al. [101111] ) have used this idea to approximate the 
Shannon entropy. 

We should mention that [21] proposed estimating the log- 
arithmic moment, X/ i=1 log At[i], using F( a \ with a — > 0. 
Their idea is very similar to that in estimating entropy. 

1.3 Challenges in Data Stream Computations 

Because the elements, At[i], are time- varying, a naive 
counting mechanism requires a system of D counters to com- 
pute -FVcO exactly (unless a = 1). This is not always real- 
istic when D is large and the data are frequently updated 
at very high rate. For example, if At [i] records activities for 
each user i, identified by his/her IP address, then potentially 
D = 2 64 (possibly much larger in the near future). 

Due to the huge volume, streaming data are often not 
(fully) stored, even on disks [3]- One common strategy is 
to store only a small "sample" the data; and sampling has 
become an important topic in Web search and data streams 
[14ll4l[T3]. While some modern databases (e.g., Yahoo! 's 2- 
petabyte database) and government agencies do store the 
whole data history, the data analysis often has to be con- 
ducted on a (hopefully) representative small sample of the 
data. As it is well-understood that general-purpose simple 
sampling-based methods often can not give reliable approxi- 
mation guarantees [3], developing special-purpose (and one- 
pass) sampling/sketching techniques in streaming data has 
become an active area of research. 

1.4 Previous Studies on Approximating Fre- 
quency Moments and Entropy 

Pioneered by Alon et.al. [2], the problem of approximating 
F( a -) in data streams has been heavily studied [7ll7I 130ll5"lll9l 



I33| [8]. The method of symmetric stable random projections 
(Indyk [18], Li [22]) is regarded to be practical and accurate. 

We have mentioned that computing the first moment F(i) 
in strict-Turnstile model is trivial using a simple counter. 
One might naturally speculate that when a ~ 1, computing 
(approximating) Fr a ) should be also easy. However, none of 
the previous algorithms including symmetric stable random 
projections could capture this intuition. For example, Fig- 
ure [1] in Section [3] shows that the performance of symmetric 
stable random projections is roughly the same for a = 1 and 
a ~ 1, even though a = 1 should be trivial. 

Compressed Counting (CC) 23,21,24 was recently 
proposed to overcome the drawback of previous algorithms 
at a « 1. CC improves symmetric stable random projections 
uniformly for all < a < 2 and the improvement is in a 
sense "infinite" when a — > 1 as shown in Figure [T] in Section 
[3] However, no empirical studies on CC have been reported. 

Zhao et.al. [35] applied symmetric stable random projec- 
tions to approximate the Shannon entropy. [2T] cited [35], as 
one application of Compressed Counting (CC). A nice theo- 
retical paper in FOCS'08 by Harvey et.al. [101111] provided 
the criterion to choose the a so that the Shannon entropy can 
be approximated with a guaranteed accuracy, using the ath 
frequency moment. [TT] cited both symmetric stable random 
projections 18,22] and Compressed Counting [21] , 

There are other methods for estimating entropy, e.g., [9], 
which we do not compare with in this study. 

1.5 Summary of Our Contributions 

Our main contribution is the first empirical study of Com- 
pressed Counting for estimating entropy. Some theoretical 
analysis is also conducted. 

• We apply Compressed Counting (CC) to compute the 
Renyi entropy, the Tsallis entropy, and the Shannon 
entropy, on real Web crawl data. 

• We empirically compare CC with symmetric stable ran- 
dom projections and demonstrate the huge improve- 
ment. Thus, our work helps establish CC as a promis- 
ing practical tool in data stream computations. 

• We provide some theoretical analysis for approximat- 
ing entropy, for example, the variance-bias trade-off. 

• Our empirical work leads to practical recommenda- 
tions for various estimators developed in [23U21[[24] . 

For estimating the Shannon entropy, the theoretical work 
by Harvey et.al. [101111] used symmetric stable random pro- 
jections or CC as a subroutine (a two-stage "black-box" ap- 
proach). That is, they first determined at what a — 1 + S 
value, H a (or T a ) is close to H within a required accuracy. 
Then they used this chosen ath frequency moment to ap- 
proximate the Shannon entropy, independent of whether the 
frequency moments are estimated using CC or symmetric 
stable random projections. 

In comparisons, we demonstrate that estimating Shannon 
entropy is a variance-bias trade-off; and hence the perfor- 
mance is highly coupled with the underlying estimators. The 
two-stage "black-box" approach [1011 1 1J may have some the- 
oretical advantage (e.g., simplifying the analysis), while our 
variance-bias analysis directly reflects the real-world situa- 
tion and leads to practical recommendations. 



• [TOl[TT] let a = 1 + S and provided the procedures to 
compute S (or a series of 5's). If one actually carries 
out the calculation, their |5| is very small (like 10~ 4 or 
smaller). Consequently their theoretically calculated 
sample size may be (impractically) large, especially 
when using symmetric stable random projections. 

• In comparison, we provide a practical recommendation 
for estimating Shannon entropy: using CC with a w 
0.98 ~ 0.99 and the optimal quantile estimator. Only 
a small sample (e.g., 20) can achieve a high accuracy 
(e.g., < 1% relative errors). 

• We demonstrate that due to the variance-bias trade- 
off, there will be an "optimal" a value that could attain 
the best mean square errors for estimating the Shan- 
non entropy. This optimal a can be quite away from 
1 when using symmetric stable random projections. 

1.6 Organization 

Section [2] reviews some applications of entropy. The basic 
methodologies of CC and various estimators for recovering 
the ath frequency moments are reviewed in Section [3] We 
analyze in Section [4] the biases and variances in estimating 
entropies. Experiments on real Web crawl data are pre- 
sented in Section [5] Finally, Section [6] concludes the paper. 

2. SOME APPLICATIONS OF ENTROPY 

2.1 The Shannon Entropy 

The Shannon entropy, H defined in J3j|, is a fundamen- 
tal measure of randomness. A recent paper in WSDM'08 
(Mei and Church [26]) was devoted to estimating the Shan- 
non entropy of MSN search logs, to help answer some basic 
problems in Web search, such as, how big is the web? 

The search logs can be naturally viewed as data streams, 
although [26] only analyzed several "snapshots" of a sample 
of MSN search logs. The sample used in [26] contained 10 
million <Query, URL,IP> triples; each triple corresponded 
to a click from a particular IP address on a particular URL 
for a particular query. [26| drew their important conclusions 
on this (hopefully) representative sample. We believe one 
can (quite easily) apply Compressed Counting (CC) on the 
same task, on the whole history of MSN (or other search 
engines) search logs instead of a (static) sample. 

Using the Shannon entropy as an important "feature" for 
mining anomalies is a widely used technique (e.g., [HI])- In 
IMC'07, Zhao et.al. [35] applied symmetric stable random 
projections to estimate the Shannon entropy for all origin- 
destination (OD) flows in network measurement, for cluster- 
ing traffic and detecting traffic anomalies. 

Detecting anomaly events in real-time (DDoS attacks, net- 
work failures, etc.) is highly beneficial in monitoring net- 
work performance degradation and service disruptions. Zhao 
et.al. 35 hoped to capture those events in real-time by ex- 
amining the entropy of every OD flow. They resorted to 
approximate algorithms because measuring the Shannon en- 
tropy in real-time is not possible on high-speed links due to 
its memory requirements and high computational cost. 

2.2 The Renyi Entropy 



The Renyi entropy, H a defined in Q, is a generalization 
of the classical Shannon entropy H. H a is function of the 
frequency moment Fr a ) and approaches H as a — > 1. Thus 
it is natural to use H a with a m 1 to approximate H . 

The Renyi entropy has other applications. It is a diversity 
index in ecology [31129 25 . It is used for analyzing expander 
graphs [T5] and other applications, e.g. [37] . 

2.3 The Tsallis Entropy 

The Tsallis entropy, T a defined in ([S]), is another general- 
ization of the Shannon entropy H. Since T a — > H as a — > 1, 
the Tsallis entropy provides another algorithm for approxi- 
mating the Shannon entropy. 

The Tsallis entropy is widely used in statistical physics 
and mechanics. Interested readers may consult the link 
www. cscs .umich. edu/~crshalizi/notabene/tsallis .html 

3. REVIEW COMPRESSED COUNTING (CC) 

Compressed Counting (CC) assumes the relaxed strict- 
Turnstile data stream model. Its underlying technique is 
based on maximally-skewed stable random projections. 

3.1 Maximally- Skewed Stable Distributions 

A random variable Z follows a maximally-skewed a-stable 
distribution if the Fourier transform of its density is [36| 

&z(t) = Eexp [\T-iZt) 

= exp(-F\t\ a (l-V=T/3sign(t)tan(^))) , 

where < a < 2, F > 0, and ,3 = 1. We denote Z ~ 
S(a,/3 = 1,F). The skewness parameter /3 for general sta- 
ble distributions ranges in [—1, 1]; but CC uses /3 = 1, i.e., 
maximally-skewed. Previously, the method of symmetric 
stable random projections [181122] used (3 — 0. 

Consider two independent variables, Zi,Z2 ~ S(a,f3 = 
1,1). For any non- negative constants Ci and C2, the "te- 
stability" follows from properties of Fourier transforms: 

Z = CxZx + C2Z2 ~ S (a, P = 1, C? + C'2) ■ 

Note that if /3 = 0, then the above stability holds for 
any constants Ci and C2. This is why symmetric stable 
random projections 18 ,22] can work on general data but 
CC only works on non-negative data (i.e., relaxed strict- 
Turnstile model) . Since we are interested in the entropy, the 
non-negativity constraint is natural, because the probability 
should be non- negative. 

3.2 Random Projections 

Conceptually, one can generate a matrix R £ R £>xfc and 
multiply it with the data stream At, i.e., X — R T A t G R*\ 
The resultant vector X is only of length k. The entries of R, 
rij, are i.i.d. samples of a stable distribution S(a,(3 = 1, 1). 

By property of Fourier transforms, the entries of X, Xj 
j = 1 to k, are i.i.d. samples of a stable distribution 

D 

~sL=l,F (a) = ^i t [A (6) 

whose scale parameter F^ is exactly the ath frequency 
moment of A t . 



Therefore, CC boils down to a statistical estimation prob- 
lem. If we can estimate the scale parameter from k samples, 
we can then estimate the frequency moments and entropies. 

For real implementations, one should conduct R T A t in- 
crementally. This is possible because the Turnstile model 
(0 is a linear updating model. That is, for every incoming 



at = (it, It), we update Xj 



+ r it jit for j = 1 to k. 



Entries of R are generated on-demand as necessary. 

3.3 The Efficiency in Processing Time 

Ganguly and Cormode [8] commented that, when k is 
large, generating entries of R on-demand and multiplica- 
tions Ti t jlt, j = 1 to k, can be prohibitive when data arrive 
at very high rate. This can be a drawback of stable random 
projections. An easy "fix" is to use k as small as possible. 

At the same k, all procedures of CC and symmetric stable 
random projections are the same except the entries in R 
follow different distributions. Thus, both methods have the 
same efficiency in processing time at the same k. However, 
since CC is much more accurate especially when a m 1, 
it requires a much smaller k for reaching a specified level of 
accuracy. For example, while using symmetric stable random 
projections with k — 10000 is prohibitive, using CC with 
k — 20 only may be practically feasible. Therefore, CC 
in a sense naturally provides a solution to the problem of 
processing efficiency. 

3.4 Three Statistical Estimators for CC 

In this study, we consider three estimators from 23 2 1124] . 
which are promising for good performance near a = 1. 

Recall CC boils down to estimating the scale parameter 
F( a ) from k i.i.d. samples Xj ~ S (q,/3 = 1, F(a))- 

3.4.1 The Geometric Mean Estimator 
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if a < 1, k(o) =2 



k) 
if a > 1. 



This estimator is strictly unbiased, i.e., E (F^^m^ 



(a),gm j 

variance is 



Var Fr, 



L <■■) 



a < 1 



(8) 



■(„l i!m) and its asymptotic (i.e., as k — > oo 

(l-« 2 )+°Ur) 

^(q-1)(5- Q ) + o(^),q>1 

As q — > 1, the asymptotic variance approaches zero. 

The geometric mean estimator is important for theoretical 
analysis. For example, [21] showed that when a = 1±A — > 1 
(i.e., A — > 0), the "constant" G in its sample complexity 
bound k = O (^) approaches G — > e at the rate of y/A. 
That is, as ct — > 1, the complexity becomes k — 0(l/e) 
instead of O (l/e 2 ). Note that O (l/e 2 ) is the well-known 
large-deviation bound for symmetric stable random projec- 
tions. The sample complexity bound determines the sample 
size k needed for achieving a relative accuracy within a 1 ± e 
factor of the truth. 



In many theory papers, the "constants" in tail bounds are 
often ignored. The geometric mean estimator for CC demon- 
strates that in special cases the "constants" may be so small 
that they should not be treated as "constants" any more. 

3.4.2 The Harmonic Mean Estimator 



(a) ,hm 



CQS ( fsr ) 
r(i+a) 



l/ 2r 2 (l + a) 
k \ r(l + 2a) 



which is asymptotically unbiased and has variance 
F U / 2r 2 (l + q) _ 

k Vr(i + 2Q!) 



Var Ft 



(a) ,hm 



o 



(10) 



F( a ) t hm is defined only for a < 1 and is considerably more 
accurate than the geometric mean estimator F( a ), gm . 

3.4.3 The Optimal Quantile Estimator 



(a),oq 



g'-QuantilejjiEjljj = 1, 2, k} 



where 



W a = g*-Quantile{|5'(a,/3 = 1,1) |}. 



(11) 



(12) 



To compute F^ a ^ oq , one sorts \xj\, j = 1 to k and uses 
the g*th smallest, i.e., g*-Quantile{|a;j|, j = 1,2, ...,k}. q* is 
chosen to minimize the asymptotic variance. 

[24] provides the values for q* , W a , as well as the asymp- 
totic variances. For convenience, we tabulate the values for 
a G [0.8, 1.2] in Table [1] The last column contains the 
asymptotic variances (with F( a ^ — 1) without the ^ factor. 



Table 1: 


a 


<? 


W a 


Var 


0.80 


0.108 


2.256365 


0.15465894 


0.90 


0.101 


5.400842 


0.04116676 


0.95 


0.098 


11.74773 


0.01059831 


0.98 


0.0944 


30.82616 


0.001724739 


0.989 


0.0941 


56.86694 


0.0005243589 


1.011 


0.8904 


58.83961 


0.0005554749 


1.02 


0.8799 


32.76892 


0.001901498 


1.05 


0.855 


13.61799 


0.01298757 


1.10 


0.827 


7.206345 


0.05717725 


1.20 


0.799 


4.011459 


0.2516604 



Compared with the geometric mean and harmonic mean 
estimators, F( a ) iSm and f( a ),hm, the optimal quantile esti- 
mator F( Q ) !0 q has some noticeable advantages: 

• When the sample size k is not too small (e.g., k > 50), 
F<ct),oq is more accurate then Fr a \ gm , especially for 
q > 1. It is also more accurate than F( a ^ hm , when a 
is close to 1. Our experiments will verify this point. 

• F( a )^ oq is computationally more efficient because both 
F( a ),gm an( l F(ot),hm require k fractional power opera- 
tions, which are expensive. 

The drawbacks of the optimal quantile estimator are: 

• For small samples (e.g., k < 20), F( a ), oq exhibits bad 
behaviors when a > 1. 



• Its theoretical analysis, e.g., variances and tail bounds, 
is based on the density function of skewed stable dis- 
tributions, which do not have closed-forms. 

• The parameters, q* and W a , are obtained from the 
numerically-computed density functions. [24] provided 
q* and W a values for a > 1.011 and a < 0.989. 

3.4.4 The Geometric Mean Estimator for Symmetric 
Stable Random Projections 

For symmetric stable random projections, the following ge- 
ometric mean estimator is close to be statistically optimal 
when a as 1 1221: 
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(2 + a 2 ) +0 



(13) 
(14) 



where Zj ~ S (a,/3 = 0, i^a))' 

Therefore, we only compare CC with this estimator, which 
was explicitly used in [101111] for the task of residual moment 
estimation for the general Turnstile model. 

3.4.5 Comparisons of Asymptotic Variances 

Figure[T]compares the variances of the three estimators for 
CC, as well as the geometric mean estimator for symmetric 
stable random projections. 
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Figure 1: Let F be an estimator of F with asymp- 
totic variance Var (p\ = V^- + O (^). We plot the 
V values for the geometric mean estimator, the har- 
monic mean estimator (for a < 1), and the optimal 
quantile estimator, along with the V values for the 
geometric mean estimator for symmetric stable ran- 
dom projections in [22\ ("symmetric GM"). 



3.5 Sampling from Maximally- Skewed Stable 
Random Distributions 

The standard procedure for sampling from skewed sta- 
ble distributions is based on the Chambers-Mallows-Stuck 
method |6 . One first generates an exponential random vari- 
able with mean 1, W ~ exp(l), and a uniform random vari- 
able U ~ uniform (— f , f), then, 



Z = 



sin (a(U + p)) 
[cos U cos (pa)] 1 
S(a,P = 1,1), 



cos (U-a(U + p)) 



(15) 



where p — ^ when a < 1 and p = \ "^—f- when a > 1. 

Sampling from symmetric (/3 = 0) stable distributions 
uses the same procedure with p — 0. Thus, the only differ- 
ence is the cos 1 / (pa) term, which is a constant and can be 
removed out of the sampling procedure and put back to the 
estimates in the end, which in fact also provides better nu- 
merical stability when a — > 1. Note that the estimators (0 
and ([9]) already contain cos (pa) in the numerators. Thus, we 
can sample Z' = Z cos 1 / (pa) ~ S (a, (3, cos (pa)) instead 
of Z — S (a, f3, 1) and evaluate ([7| © without cos (pa). 

4. ESTIMATING ENTROPIES USING CC 

The basic procedure is to first estimate the ath frequency 
moment Ft a -j using CC and then compute various entropies 
using the estimated F"( a )- Here we use FV a ) to denote a 
generic estimator of FVc), which could be FV a y gm , F^ a - j hm , 

F(a),oq: O^" F^ a ^g m s y m . 

In the following subsections, we analyze the variances and 
biases in estimating the Renyi entropy H a , the Tsallis en- 
tropy T a , and the Shannon entropy H. 

4.1 Renyi Entropy 

We denote a generic estimator of H a by H a : 



H 



■ O) 

1 — a F a ' 



■ log 



(16) 



1 a.g m . sym ; 



which becomes H a , gm , H ayhm , H a ,og, and H a 
spectively, when F (a) becomes F (ahgm , F (ahhm , F {ahoq , or 

F(a), g m.sym- Since Fm can be computed exactly and triv- 
ially using a simple counter, we assume it is a constant. 

Since Ft a \ is unbiased or asymptotically unbiased, H a is 
also asymptotically unbiased. The asymptotic variance of 
H a can be computed by Taylor expansions (the so-called 
"delta method" in statistics): 



Var (^) =7T^p Var ( log (^)) 
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4.2 Tsallis Entropy 

The generic estimator for the Tsallis entropy T a would be 



T a = 



a - 1 



F. 



(i) 



(18) 



which is asymptotically unbiased and has variance 

V «(^) = (^^ V «(M + °(j?)- (19) 

4.3 Shannon Entropy 

We use H a] R and H a ,T to denote the estimators for Shan- 
non entropy using the estimated H a and T a , respectively. 
The variances remain unchanged, i.e., 



Var H a a = 



Var (# a ) , Var (J^t) = Var (t q ) . (20) 



However, H a ,R and H a ,T are no longer unbiased, even 
asymptotically (unless a — * 1). The biases would be 

Bias (h*,b) = E (H a , R -H^=H a -H + o(^j, (21) 
Bias (# Q , T ) = E (f a , R -H) =T a -H + o{^j . (22) 

The O (j) biases arise from the estimation biases in H a 
and T a and diminish quickly as k increases. In fact, there 
are standard statistics procedures to reduce the O (-|) bias 
to O (jtt)- However, the "intrinsic biases," H a — H and 
T a — H, can not be removed by increasing k; they can only 
be reduced by letting a close to 1. 

The total error is usually measured by the mean square 
error: MSE = Bias 2 + Var. Clearly, there is a variance-bias 
trade-off in estimating H using H a or T a . For a particular 
data stream, at each sample size k, there will be an opti- 
mal a to attain the smallest MSE. The optimal a is data- 
dependent and hence some prior knowledge of the data is 
needed in order to determine it. The prior knowledge may 
be accumulated during the data stream process. Alterna- 
tively, we could seek an estimator that is very accurate near 
a = 1 to alleviate the variance-bias affect. 

5. EXPERIMENTS 

The goal of the experimental study is to demonstrate the 
effectiveness of Compressed Counting (CC) for estimating 
entropies and to determine a good strategy for estimating 
the Shannon entropy. In particular, we focus on the esti- 
mation accuracy and would like to verify the formulas for 
(asymptotic) variances in (|17jl and (I19|) . 

5.1 Data 

Since the estimation accuracy is what we are interested in, 
we can simply use static data instead of real data streams. 
This is because the projected data vector X — H T At is the 
same, regardless whether it is computed at once (i.e., static) 
or incrementally (i.e., dynamic). As we have commented, 
the processing and storage cost of CC is the same as the cost 
of symmetric stable random projections at the same sample 
size k. Therefore, to compare these two methods, it suffices 
to compare their estimation accuracies. 

Ten English words are selected from a chunk of Web crawl 
data with D = 2 16 = 65536 pages: THE, A, THIS, HAVE, 
FUN, FRIDAY, NAME, BUSINESS, RICE, and TWIST. 
The words are selected fairly randomly, except that we make 
sure they cover a whole range of sparsity, from function 
words (e.g., A, THE), to common words (e.g., FRIDAY) 
to rare words (e.g., TWIST). 

Thus, as summarized in Tabled our data set consists of 
ten vectors of length D — 65536 and the entries are the 
numbers of word occurrences in each document. 

Table [5] indicates that the Renyi entropy H a provides a 
much better approximation to the Shannon entropy H, than 
the Tsallis entropy T a does. On the other hand, if the pur- 
pose is to find a summary statistic that is different from 
the Shannon entropy (i.e., sensitive to a), then the Tsallis 
entropy may be more suitable. 

5.2 Results 

The results for estimating frequency moments, Renyi en- 
tropy, Tsallis entropy, and Shannon entropy are presented 



Table 2: The data set consists of 10 English words se- 
lected from a chunk of D = 65536 Web pages, forming 
10 vectors of length D whose values are the word oc- 
currences. The table lists their numbers of non-zeros 
(sparsity), the Shannon entropy H, the Renyi entropy 
H a and the Tsallis entropy T a (for a = 0.95 and 1.05). 

Word Nonzero H ff 95 H 10 5 T _g 5 T 105 



TWIST 
RICE 
FRIDAY 
FUN 

BUSINESS 

NAME 

HAVE 

THIS 

A 

THE 



274 

490 

2237 

3076 

8284 

9423 

17522 

27695 

39063 

42754 



5.4873 
5.4474 
7.0487 
7.6519 
8.3995 
8.5162 
8.9782 
9.3893 
9.5463 
9.4231 



5.4962 
5.4997 
7.1039 
7.6821 
8.4412 
9.5677 
9.0228 
9.4370 
9.5981 
9.4828 



5.4781 
5.3937 
6.9901 
7.6196 
8.3566 
8.4618 
8.9335 
9.3416 
9.4950 
9.3641 



6.3256 
6.3302 
8.5292 
9.3660 
10.502 
10.696 
11.402 
12.059 
12.318 
12.133 



4.7919 
4.7276 
5.8993 
6.3361 
6.8305 
6.8996 
7.2050 
7.4634 
7.5592 
7.4775 



in the following subsections, in terms of the normalized (i.e., 



relative) mean square errors (MSEs), e.g. 



MSE(F (cv) ) MSE(ff Q ) 



etc. After normalization, we observe that the results are 
quite similar across different words. To avoid boring the 
readers, not all words are selected for the presentation. How- 
ever, we provides the experimental results for all 10 words, 
in estimating Shannon entropy. 

In our experiments, the sample size k ranges from 20 to 
10 4 . We choose 0.8 < a < 0.989 and 1.011 < a < 1.2. This 
is because [24] only provided the optimal quantile estimator 
for a > 1.011 and a < 0.989. For the geometric mean and 
harmonic mean estimators, we actually had no problem of 
using (e.g.,) a = 1 — 10~ 4 or a = 1 + 10~ 4 . 

5.2. 1 Estimating Frequency Moments 

Figure [2] Figure [3] and Figure [4] provide the MSEs for 
estimating the ath frequency moments, F 1 ( a ), for TWIST, 
RICE, and FRIDAY, respectively. 

• The errors of the three estimators for CC decrease (to 
zero, potentially) as a — > 1, while the errors of symmet- 
ric stable random projections do not vary much near 
a — 1. The improvement of CC is enormous asa-» 1. 
For example, when k = 20 and a = 0.989, the MSE of 
CC using the optimal quantile estimator is about 10 -5 
while the MSE of symmetric stable random projections 
is about 10 _1 , a 10000-fold error reduction. 

• The optimal quantile estimator F^ a ^ oq is in general 
more accurate than the geometric mean and harmonic 
mean estimators near a — 1. However, for small k 
(e.g., 20) and a > 1, F( a ^ oq exhibits some bad behav- 
iors, which disappear when k > 50 (or even k > 30). 

• The theoretical asymptotic variances in |[8j , (|10[) , (|14[) , 
and Table [1] are accurate. 




Figure 2: Frequency moments, for TWIST. 

Solid curves are empirical mean square errors 
(MSEs) and dashed curves are theoretical asymp- 
totic variances in ©, ([10|), (fl4|), and Table QJ 
"F,gm" stands for the geometric mean estimator 
(ff|) . "F,hm" for the harmonic mean estimator 



Figure 3: Frequency moments, F( a ), f° r RICE. See 
the caption of Figure [2] for more explanations. 



(f9|). "F,oq" for the optimal quantile estima- 
(|lip. and "F,gm,sym" for the geometric 



tor F( a ) ]0a 

mean estimator F( a ) tgm ^ ym (|13|) in symmetric stable 
random projections. 

5.2.2 Estimating Renyi Entropy 

Figure [5] plots the MSEs for estimating the Reny entropy 
for TWIST, with the curves for k = 20 removed. The figure 
illustrates that: (1) CC improves symmetric stable rand pro- 
jections enormously when a — - > 1; (2) The generic variance 
formula (117[) is accurate. 



5.2.3 Estimating Tsallis Entropy 

Figure [6] plots the MSEs for estimating the Tsallis en- 
tropy for RICE, illustrating that: (1) CC improves symmet- 
ric stable rand projections enormously when a — > 1; (2) The 
generic variance formula (fT9)l is accurate. 



5.2.4 Estimating Shannon Entropy 
from Renyi Entropy 

Figure [7] illustrates the MSEs from estimating the Shan- 
non entropy using the Renyi entropy, for RICE. 

• Using symmetric stable random projections with a — 
1 + S and very small \S\ is not a good strategy and not 
practically feasible because the required sample size 
is enormous. For example, using \5\ « 0.01, we need 
k = 10000 in order to achieve a relative MSE of 1%. 

• There is clearly a variance-bias trade-off, especially 
for the geometric mean and harmonic mean estima- 
tor. That is, for each k, there is an "optimal" a which 
achieves the smallest MSE. 

• Using the optimal quantile estimator does not show 
a strong variance-bias trade-off, because its has very 
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Figure 4: Frequency moments, Fi a f, for FRIDAY 



small variance near a — 1 and its MSEs are mainly 
dominated by the (intrinsic) biases, H a — H . 

• The improvement of CC over symmetric stable random 
projections is very large when a is close 1. When a is 
away from 1, the improvement becomes less obvious 
because the MSEs are dominated by the biases. 

• Using the optimal quantile estimator with a very close 
to 1 (preferably a < 1) is our recommended procedure 
for estimating Shannon entropy from Renyi entropy. 

For a fixed a and k, we can see that CC improves symmet- 
ric stable random projections enormously when a — ► 1. If 
we follow the theoretical suggestion of [1Q||11] by using (e.g.) 
a = 1 + 10 -4 , then the improvement of CC over symmetric 
stable random projections will be enormous. 

As a practical recommendation, we do not suggest letting 
a too close to 1 when using symmetric stable random projec- 
tions. Instead, one should take advantage of the variance- 




Figure 5: Reny entropy, H a , for TWIST. The theo- 
retical variances (dashed) are computed from (|17[) . 
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Figure 6: Tsallis entropy, H a , for RICE. The theo- 
retical variances (dashed) are computed from (|19|) . 
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Figure 7: Shannon entropy, H, estimated from 
Renyi entropy, H a , for RICE. Curves are the mean 
square errors (MSEs). 



bias trade-off by using a away from 1. There will be an "op- 
timal" a that attains the smallest mean square error (MSE), 
at each k. 

As illustrated in Figure [7] CC is not affected much by 
the variance-bias trade-off and it is preferable to choose a 
close to 1 when using the optimal quantile estimator. There- 
fore, we will present the comparisons mainly in terms of the 
minimum MSEs (i.e., best achievable performance), which 
we believe actually heavyily favors symmetric stable random 
projections. 



Figures [8] presents the minimum MSEs for all 10 words: 



• The optimal quantile estimator is the most accurate. 
For example, using k = 20, the relative MSE is only 
less than 1% (or even 0.1%), which may be already 
accurate enough for some applications. 

• For every k, CC reduces the (minimum) MSE roughly 
by 20- to 50- fold, compared to symmetric stable ran- 
dom projections. This is comparing the curves in the 
vertical direction. 

• To achieve the same accuracy as symmetric stable ran- 
dom projections, CC requires a much smaller k, a re- 
duction by about 50-fold (using the optimal quantile 
estimator). This is comparing the curves in the hori- 
zontal direction. 

• The results are quite similar for all 10 words. While 
it is boring to present all 10 words, the results deliver 
a strong hint that the performance of CC and its im- 
provement over symmetric stable random projections 
should hold universally, not just for these 10 words. 

5.2.5 Estimating Shannon Entropy 
from Tsallis Entropy 

Figure illustrates the MSEs from estimating Shannon 
entropy using Tsallis entropy, for RICE: 

• Using symmetric stable random projections with a — 
1 + 5 and very small \8\ is not a good strategy and 
not practically feasible. For example, when |<5| « 0.01, 
using k = 10000 can only achieve a relative MSE of 
10%. 

• The effect of the variance-bias trade-off for geometric 
mean and harmonic mean estimators, is even more sig- 
nificant, because the (intrinsic) bias T a — H is large, 
as reported in Table [2] 

• The MSEs of the optimal quantile estimator is not af- 
fected much by k, because its variance is negligible 
compared to the (intrinsic) bias. 

Figures [lOl presents the minimum MSEs for all 10 words: 

• The optimal quantile estimator is the most accurate. 
With k = 20, the relative MSE is only less than 1% 
(or even 0.1%). 

• When k < 10 3 , using the optimal quantile estima- 
tor, CC reduces minimum MSEs by roughly 20- to 
50-fold, compared to symmetric stable random projec- 
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• Even with k = 10 4 , Symmetric table random projec- 
tions can not achieve the same accuracy as CC using 
the optimal quantile estimator with k = 20 only. 

Again, using the optimal quantile estimator with a w 
0.98 0.99 would be our recommended procedure for estimat- 
ing Shannon entropy from Tsallis entropy. 



Figure 8: Shannon entropy, H, estimated from 
Renyi entropy, T a , for 10 words in Table [2j Curves 
are the minimum MSEs at each k. 




Figure 10: Shannon entropy, H, estimated from 
Tsallis entropy, T a , for 10 words in Table [2j Curves 
are the minimum MSEs at each k. 



6. CONCLUSION 

Network data and Web search data are naturally dynamic 
and can be viewed as data streams. The entropy is an ex- 
tremely useful summary statistic and has numerous appli- 
cations, for example, anomaly detection in Web mining and 
network diagnosis. 

Efficiently and accurately computing the entropy in ultra- 
large and frequently updating data streams, in one-pass, is 
an active topic of research. A recent trend is to use the ath 
frequency moments with a w 1 to approximate the entropy. 
For example, [TD ,11 proposed using the a — 1 + 6 frequency 
moments with very small \5\ (e.g., 10 -4 or smaller). 

For estimating the ath frequency moments, the recently 
proposed Compressed Counting (CC) dramatically improves 
the standard data stream algorithm based on symmetric sta- 
ble random projections, especially when a « 1. However, it 
had never been empirically evaluated before this work. 

We experimented with CC to approximate the Renyi en- 
tropy, the Tsallis entropy, and the Shannon entropy. Some 
theoretical analysis on the biases and variances was pro- 
vided. Extensive empirical studies based on some Web crawl 
data were conducted. 

Based on the theoretical and empirical results, important 
conclusions can be drawn: 

• Compressed Counting (CC) is numerically stable and 
is capable of providing highly accurate estimates of 
the ath frequency moments. When a is close to 1, 
the improvements of CC over symmetric stable ran- 
dom projections in estimating frequency moments is 
enormous; in fact, the improvements tend to "infinity" 
when a — > 1. 

• When a is close 1, the optimal quantile estimator for 
CC is more accurate than the geometric mean and har- 
monic mean estimators, except when a > 1 and the 
sample size k is very small (e.g., k < 20). 

• It appears not a practical algorithm to approximate 
the Shannon entropy using symmetric stable random 
projections with a = 1 + S and very small \5\. When 
we do need to use symmetric stable random projections, 
we should take advantage of the variance-bias trade- 
off by using a away from 1 for achieving smaller mean 
square errors (MSEs). 

• CC is able to provide highly accurate estimates of the 
Shannon entropy using either the Renyi entropy or the 
Tsallis entropy. In terms of the best achieable MSEs, 
the improvements over symmetric stable random pro- 
jections can be about 20- to 50-fold. 

• When estimating Shannon entropy from Renyi entropy, 
in order to reach the same accuracy as CC, symmetric 
stable random projections would need about 50 times 
more samples than CC. When estimating Shannon en- 
tropy from Tsallis entropy, symmetric stable random 
projections could not reach the same accuracy as CC 
even with 500 times more samples. 

• The Renyi entropy provides a better tool for estimating 
the Shannon entropy than the Tsallis entropy does. 

• Our recommended procedure for estimating the Shan- 
non entropy is to use CC with the optimal quantile 
estimator and a < 1 close 1 (e.g., 0.98 ~ 0.99). 



• Since CC only needs a very small sample to achieve 
a good accuracy, the processing time of CC will be 
much reduced, compared to symmetric stable random 
projections, if the same level of accuracy is desired. 

The technique of estimating Shannon entropy using sym- 
metric stable random projections has been applied with some 
success in practical applications, such as network anomaly 
detection and diagnosis [35]. One major issue reported in 
[35] (also [8]), is that the required sample size using symmet- 
ric stable random projections could be prohibitive for their 
real-time applications. Since CC can dramatically reduce 
the required sample size, we are passionate that using Com- 
pressed Counting for estimating Shannon entropy will be 
highly practical and beneficial to real- world Web/network/data 
stream problems. 
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